Information Extraction Using the Structured Language Model 



Ciprian Chelba and Milind Mahajan 

Microsoft Research 
Microsoft Corporation 
One Microsoft Way, Redmond, WA 98052 
{chelba, milindm}@microsof t . com 



Abstract 

The paper presents a data-driven approach to infor- 
mation extraction (viewed as template filling) using 
the structured language model (SLM) as a statistical 
parser. The task of template filling is cast as con- 
strained parsing using the SLM. The model is auto- 
matically trained from a set of sentences annotated 
with frame/slot labels and spans. Training pro- 
ceeds in stages: first a constrained syntactic parser 
is trained such that the parses on training data meet 
the specified semantic spans, then the non-terminal 
labels are enriched to contain semantic information 
and finally a constrained syntactic+semantic parser 
is trained on the parse trees resulting from the pre- 
vious stage. Despite the small amount of training 
data used, the model is shown to outperform the 
slot level accuracy of a simple semantic grammar 
authored manually for the MiPad — personal infor- 
mation management — task. 

1 Introduction 

Information extraction from text can be character- 



ized as template filling ( Jurafsky and Martin, 2000 ): 
a given template or frame contains a certain num- 
ber of slots that need to be filled in with segments 
of text. Typically not all the words in text are rele- 
vant to a particular frame. Assuming that the seg- 
ments of text relevant to filling in the slots are non- 
overlapping contiguous strings of words, one can rep- 
resent the semantic frame as a simple semantic parse 
tree for the sentence to be processed. The tree has 
two levels: the root node is tagged with the frame 
label and spans the entire sentence; the leaf nodes 
are tagged with the slot labels and span the strings 
of words relevant to the corresponding slot. 

Consider the semantic parse S for a sentence W 
presented in Fig. [j[ CalendarTask is the frame tag, 

(CalendarTask schedule meeting with 

(ByFullName*Person megan hokins) about 

(SubjectByWildCard*Subject internal lecture) 
at (PreciseTime*Time two thirty p.m.)) 

Figure 1: Sample sentence and semantic parse 



spanning the entire sentence; the remaining ones are 
slot tags with their corresponding spans. 

In the MiPad scenario ( Huang et al., 2000| ) — es- 
sentially a personal information management (PIM) 
task — there is a module that is able to convert 
the information extracted according to the semantic 
parse into specific actions. In this case the action is 
to schedule a calendar appointment. 

We view the problem of information extraction as 
the recovery of the two-level semantic parse S for a 
given word sequence W. 

We propose a data driven approach to information 
extracti on that uses the structure d language model 
(SLM) ( Chelba and Jelinek, 200C ) as an automatic 
parser. The parser is constrained to explore only 
parses that contain pre-set constituents — spanning 
a given word string and bearing a tag in a given 
set of semantic tags. The constraints available dur- 
ing training and test are different, the test case con- 
straints being more relaxed as explained in Section ||. 

The main advantage of the approach is that it 
doesn't require any grammar authoring expertise. 
The approach is fully automatic once the annotated 
training data is provided; it does assume that an 
application schema — i.e. frame and slot structure 
- has been defined but does not require seman- 
tic grammars that identify word-sequence to slot or 
frame mapping. However, the process of convert- 
ing the word sequence coresponding to a slot into 
actionable canonical forms — i.e. convert half past 
two in the afternoon into 2:30 p.m. — may require 
grammars. The design of the frames — what infor- 
mation is relevant for taking a cert ain action, w hat 
slot /frame tags are to be used, see ( Wang, 1999 ) - 
is a delicate task that we will not be concerned with 
for the purposes of this paper. 

The remainder of the paper is organized as fol- 
lows: Section || reviews the structured language 
model (SLM) followed by Section || which describes 
in detail the training procedure and Section ^ which 
defines the operation of the SLM as a constrained 
parser and presents the necessary modifications to 
the model. Section [5] compares our approach to oth- 
ers in the literature, in particular that of ( Miller et 



al., 2000). Section ^ presents the experiments we 
have carried out. We conclude with Section [?]. 

2 Structured Language Model 

We proceed with a brief review of the structured 
language model (SLM); an e xtensive presentation 
of th e SLM can be found in ( Chclba and JclinckJ 
2000j ). The model assigns a probability P(W,T) 
to every sentence W and its every possible binary 
parse T. The terminals of T are the words of W 
with POStags, and the nodes of T are annotated 
with phrase headwords and non-terminal labels. Let 



h_(-m) =(<s>, SB) 



O.word, h O.tag) 




in Figures |||| and they ensure that all possible 
binary branching parses with all possible head- 
word and non-terminal label assignments for the 
w\ . . . Wk word sequence can be generated. The 
Pi . . . p k Nk sequence of PARSER operations at 
position k grows the word-parse (fc — l)-prefix 
into a word-parse fc-prefix. 



h'_0 = (h_{-l (.word, NTlabel) 



T'_|-m+l )<-<s> 




(<s>,SB) (w_p, t_p)(w_{p+l), t_(p+l)) (w_k, t_k) w_{k+l}.... </s> 

Figure 2: A word-parse /c-prefix 

W be a sentence of length n words to which we 
have prepended the sentence beginning marker <s> 
and appended the sentence end marker </ s> so that 
wo =<s> and w n +i =</s>. Let Wk = wo . . . Wk be 
the word fc-prefix of the sentence — the words from 
the begining of the sentence up to the current po- 
sition k — and WkTk the word-parse k-prefix. Fig- 
ure H shows a word-parse fc-prefix; h_0 . . h_{-m> 
are the exposed heads, each head being a pair (head- 
word, non-terminal label), or (word, POStag) in the 
case of a root-only tree. The exposed heads at a 
given position k in the input sentence are a function 
of the word-parse fc-prefix. 

2.1 Probabilistic Model 

The joint probability P(W, T) of a word sequence W 
and a complete parse T can be broken into: 

P(W, T) = 

]Tkt\[P{wk/Wk-iTk-x) ■ P(tk/Wk-iTk-x,Wk) ■ 

nSi p(pi/Wk-iTk- 1 ,w k ,tk,p k 1 . . .pti)] 

where: 

• VFfc_iTfc_i is the word-parse (fc — l)-prefix 

• Wk is the word predicted by WORD- 
PREDICTOR 

• tk is the tag assigned to Wk by the TAGGER 

• Nk — 1 is the number of operations the PARSER 
executes at sentence position k before passing 
control to the WORD-PREDICTOR (the N k -th 
operation at position k is the null transition); 
Nk is a function of T 

• p\ denotes the i-th PARSER operation carried 
out at position k in the word string; the opera- 
tions performed by the PARSER are illustrated 



Figure 3: Result of adjoin-left under NTlabel 



T'_|-m+l)<-<s> 




Figure 4: Result of adjoin-right under NTlabel 

Our model is based on three probabilities, each 
estimated using deleted interpolation and parame- 
terized (approximated) as follows: 

P{w k /Wk-iT k -i) = P{wk/ho,h^) 
P{t k /w kl Wk-iT k ^) = P(t k /w k ,ho,h-i) 
P(p k /W k T k ) = P(p k /h 0l h^) 

It is worth noting that if the binary branching struc- 
ture developed by the parser were always right- 
branching and we mapped the POStag and non- 
terminal label vocabularies to a single type then 
our model would be equivalent to a trigram lan- 
guage model. Since the number of parses for a 
given word prefix Wk grows exponentially with k, 
\{Tk}\ ~ 0(2 k ), the state space of our model is huge 
even for relatively short sentences, so we had to use 
a search strategy that prunes it. Our choice was a 
synchronous multi-stack search algorithm which is 
very similar to a beam search. 

The language model probability assignment for 
the word at position k + 1 in the input sentence is 
made using: 

P(w k+ i/Wk) = J2 P(w k +i/WkT k )-p(W k T k ), 
p{W k T k ) = P{W k T k )/ P{WkT k ) (1) 

which ensures a proper probability over strings W* , 
where Sk is the set of all parses present in our stacks 
at the current stage k. 



2.2 Model Parameter Estimation 

Each model component — WORD-PREDICTOR, 
TAGGER, PARSER — is initialized from a set 
of parsed sentences after undergoing headword 
percolation and binarization. Separately for each 
model component we: 

• gather counts from "main" data — about 90% 
of the training data 

• estimate the interpolation coefficients on counts 
gathered from "check" data — the remaining 
10% of the training data. 



An N-best EM ( [Dempster et al., 1977|) variant is 
then employed to jointly re-estimate the model pa- 
rameters such that the likelihood of the training data 
under our model is increased. 

3 Training Procedure 

This section describes the training procedure for 
the SLM when applied to information extraction 
and introduces the modifications that need to be 
made to the SLM operation. 

The training of the model proceeds in four stages: 

1. initialize the SLM as a syntactic parser for the 
domain we are interested i n. A general pu rpose 
parser (such as NLPwin ( [Hcidorn, 19991 )) can 
be used to generate a syntactic treebank from 
which the SLM parameters can be initialized. 
Another possibility for initializing the SLM is 
to use a treebank for out -of-domain data (suc h 
as the UPenn Tre ebank ( Marcus et al., 199:j| )) 
— see Section 6.1. 



2. train the SLM as a matched constrained parser. 
At this step the parser is going to propose a set 
of N syntactic binary parses for a given word 
string (N-best parsing), all matching the con- 
stituent boundaries specified by the semantic 
parse: a parse T is said to match the seman- 
tic parse S, denoted T 3 S, if and only if the 
set of un-labeled constituents that define S is 
included in the set of constituents that define 
T. 

At this time only the constituent span informa- 
tion in S is taken into account. 

3. enrich the non-terminal and pre-terminal labels 
of the resulting parses with the semantic tags 
(frame and slot) present in the semantic parse, 
thus expanding the vocabulary of non-terminal 
and pre-terminal tags used by the syntactic 
parser to include semantic information along- 
side the usual syntactic tags. 

4. train the SLM as a L(abel) -matched constrained 
parser to explore only the semantic parses for 



the training data. This time the semantic con- 
stituent labels are taken into account too, which 
means that a parse P — containing both syn- 
tactic and semantic information — is said to 
L(abeled) -match S if and only if the set of la- 
beled semantic constituents that defines S is 
identical to the set of semantic constituents that 
defines P. If we let SEM(P) denote the func- 
tion that maps a tree P containing both syntac- 
tic and semantic information to the tree con- 
taining only semantic information, referred to 
as the semantic projection of P, then all the 
parses Pi,Vi < N, proposed by the SLM for a 
given sentence W, L-match S and thus satisfy 
SEM{Pi) = S,Vi< N. 

The semantic tree S has a two level structure 
so the above requirement can be satisfied only if 
the parses SEM(P) proposed by the SLM are 
also on two levels, frame and slot level respec- 
tively. We have incorporated this constr aint 
into the operation of the SLM — see Section p~2] . 

The model thus trained is then used to parse 
test sentences and recover the semantic parse us- 
ing S — SEM(a,rgmaxp i P(Pi,W)). In principle, 
one should sum over all the parses P that yield 
the same semantic parse S and then choose S = 
argmaxs £p ia .t. S EM(p t )=s P(Pi,W). 

A few iterations of the N-best EM variant — see 
Section |2| — were run at each of the second and 
fourth step in the training procedure. The con- 
strained parser operation makes this an EM variant 
where the hidden space — the possible parse trees 
for a given sentence — is a priori limited by the se- 
mantic constraints to a subset of the hidden space 
of the unrestricted model. At test time we wish to 
recover the most likely subset of the hidden space 
consistent with the constraints imposed on the sen- 
tence. 

To be more specific, during the second training 
stage, the E-step of the reestimation procedure will 
only explore syntactic trees (hidden events) that 
match the semantic parse; the fourth stage E-steps 
will consider hidden events that are constrained even 
further to L-match the semantic parse. We have no 
proof that this procedure should lead to better per- 
formance in terms of slot/frame accuracy but intu- 
itively one expects it to place more and more proba- 
bility mass on the desirable trees — that is, the trees 
that are consistent with the semantic annotation. 
This is confirmed experimentally by the fact that 
the likelihood of the training word sequence (observ- 
able) — calculated by Eq. (||) where the sum runs 
over the parse trees that match/L-match the seman- 
tic constraints — does increase^] at every training 
step, as presented in Section ||, Table |l|. However, 



1 Equivalent to a decrease in perplexity 



the increase in likelihood is not always correlated 
with a decrease in error rate on the training data, 
see Tables |^ and || in Section ^. 

4 Constrained Parsing Using the 
Structured Language Model 

We now detail the constrained operation of the SLM 
- matched and L-matched parsing — used at the 
second and fourth steps of the training procedure 
described in the previous section. 

A semantic parse S for a given sentence W con- 
sists of a set of constituent boundaries along with 
semantic tags. When parsing the sentence using the 
standard formulation of the SLM, one obtains binary 
parses that are not guaranteed to match the seman- 
tic parse S, i.e. the constituent proposed by the 
SLM may cross semantic constituent boundaries; for 
the constituents matching the semantic constituent 
boundaries, the labels proposed may not be the de- 
sired ones. 

To fix terminology, we define a constrained con- 
stituent — or simply a constraint — c to be a span 
together with a sevQ of allowable tags for the span: 
c =< l,r,Q > where I is the left boundary of the 
constraint, r is the right boundary of the constraint 
and Q is the set of allowable non-terminal tags for 
the constraint. 

A semantic parse can be viewed as a set of con- 
straints; for each constraint the set of allowable non- 
terminal tags Q contains a single element, respec- 
tively the semantic tag for each constituent. An ad- 
ditional fact to be kept in mind is that the semantic 
parse tree consists of exactly two levels: the frame 
level (root semantic tag) and the slot level (leaf se- 
mantic tags). 

During training, we wish to constrain the SLM op- 
eration such that it considers only parses that match 
the constraints Cj, i = 1 . . . C as it proceeds left to 
right through a given sentence W. In light of the 
training procedure sketched in the introduction, we 
consider two flavors of constrained parsing, one in 
which we only generate parses that match the con- 
straint boundaries and another in which we also en- 
force that the proposed tag for every matching con- 
stituent is among the constrained set of non-terminal 
tags Ci.Q — L(abeled) -match constrained parsing. 

The only constraints available for the test sen- 
tences are: 

• the semantic tag of the root node — which 
spans the entire sentence — must be in the 
set of frame tags. If it were a test sen- 
tence the example in Figure [l] would have 
the following semantic parse (constraints): 
({CalendarTask , Contact sTask , MailTask} 



schedule meeting with megan hokins about 
internal lecture to two thirty p.m.) 

• the semantic projection of the trees proposed 
by the SLM must have exactly two levels; this 
constraint is built in the operation of the L- 
match parser. 

The next section will des crib e the constrained 
parsing algorithm. Section 4.2 will describe fur- 



ther changes that the algorithm uses to produce 
only parses P whose semantic projection SEM(P) 
has exactly two levels, frame (root) and slot (leaf) 
level, respectively — o nly in the L-match case. We 
conclude with Section |4.3| explaining how the con- 
strained parsing algorithm interacts with the prun- 
ing of the SLM search space for the most likely parse. 

4.1 Match and L-match SLM Parsing 

The trees produced by the SLM are binary trees. 
The tags annotating the nodes of the tree are purely 
syntactic — during the second training stage — or 
syntactic+semantic — during the last training stage 
or at test time. It can be proved that satisfying 
the following two conditions at each position k in 
the input sentence ensures that all the binary trees 
generated by the SLM parsing algorithm match the 
pre-set constraints Ci, i = 1 ... C as it proceeds left to 
right through the input sentence W = wq . . . w n +i- 

• for a given word-parse fc-prefix W^Tk (see Sec- 
tion |2J) accept an adjoin transition if and only 
if: 

1. the resulting constituent does not violate^] 
any of the constraints Cj, i = 1 . . . C 

2. L-match parsing only: if the seman- 
tic projection of the non-terminal tag 
SEM(NTtag) proposed by the adjoin op- 
eration is non-void then the newly created 
constituent must L-match an existing con- 
straint, 3 a s.t. SEM(NTtag) 6 c A .Q. 

• for a given word-parse fc-prefix WkTk (see Sec- 
tion |2J) accept the null transition if and only 
if all the constraints c, whose right boundary 
is equal to the current word index fc, Ci.r = k, 
have been matched. If these constraints remain 
un-matched they will be broken at a later time 
during the process of completing the parse for 
the current sentence W: there will be an adjoin 
operation involving a constituent to the right 
of the current position that will break all the 
constraints ending at the current position k. 

4.2 Semantic Tag Layering 

The two-layer structure of the semantic trees need 
not be enforced during training, simply L-matching 



2 The set of allowable tags must contain at least one ele- 
ment 



3 A constraint is violated by a constituent if the span of 
the constituent crosses the span of the constraint. 



the semantic constraints will implicitly satisfy this 
constraint. As explained above, for test sentences 
we can only specify the frame level constraint, leav- 
ing open the possibility of generating a tree whose 
semantic projection would contain more than two 
levels — nested slot level constituents. In order 
i to avoid thiu, each tree in a given word paroo haa 



two bite that doecribo whothor tho troo already con — 

tains a constituent whose semantic projection is a 
frame/slot level tag, respectively. An adjoin opera- 
tion proposing a tag that violates the correct layer- 
ing of frame/slot level tags can now be detected and 
discarded. 

4.3 Interaction with Pruning 

In the absence of pruning the search for the most 
likely parse satisfying the constraints for a given 
sentence becomes computationally intractableQ In 
practice, we are forced to use pruning techniques in 
order to limit the size of the search space. However, 
it is possible that during the left to right traversal of 
the sentence, the pruning scheme will keep alive only 
parses whose continuation cannot meet constraints 
that we have not encountered yet and no complete 
parse for the current sentence can be returned. In 
such cases, we back-off to unconstrained parsing — 
regular SLM usage. In our experiments, we noticed 
that this was necessary for very few training sen- 
tences (1 out of 2,239) and relatively few test sen- 
tences (31 out of 1,101). 

5 Comparison with Previous Work 

The use of a syntactic parser augmented with se- 
mantic tags for information information from text is 
not a novel idea. The basic approach we described 



is ver y similar to the one presented in ( Miller et al. 
2000] ) however there are a few major differences: 



• in our approach the augmentation of the syn- 
tactic tags with semantic tags is straightforward 
due to the fact that the semantic consti tuents 
are matched exactlyFI. The approach in ( Miller 



et al. 



2000) needs to insert additional nodes 
in the syntactic tree to account for the seman- 
tic constituents that do not have a correspond- 
ing syntactic one. We believe our approach en- 
sures tighter coupling between the syntactic and 
the semantic information in the final augmented 
trees. 

• our constraint definition allows for a set of se- 
mantic tags to be matched on a given span. 



4 It is assumed that the constraints for a given sentence are 
consistent, namely there exists at least one parse that meets 
all of them. 

5 This is a consequence of the fact that the SLM generates 
binary trees 



• the two-level layering constraint on semantic 
trees is a structural constraint that is embed- 
ded in the operation of the SLM and thus can 
be guaranteed on test sentences. 

The semantic annotation required b y our task is 



much simpler than that employed by (Miller et al 



200C ). One po ssibly beneficial ext ension of our work 
suggested by ( Miller et al., 2000| ) would be to add 



semantic tags describing relations between entities 
(slots), in which case the semantic constraints would 
not be structured strictly on the two levels used in 
the current approach, respectively frame and slot 
level. However, this would complicate the task of 
data annotation making it more expensive. 

The same constrained EM variant employed for 
rees timating the model param eters has been used 
by ( Pcrciraand Schabes, 1992 ) for training a purely 
syntactic parser showing increase in likelihood but 
no improvement in parsing accuracy. 

6 Experiments 

We have evaluated th e model on manually annotated 
data for the MiPad ( [Huang et al., 2000| ) task. We 
have used 2,239 sentences (27,119 words) for train- 
ing and 1,101 sentences (8,652 words) for test. There 
were 2,239/5,431 semantic frames/slots in the train- 
ing data and 1,101/1,698 in the test data, respec- 
tively. 

The word vocabulary size was 1,035, closed over 
the test data. The slot and frame vocabulary 
sizes were 79 and 3, respectively. The pre-terminal 
(POStag) vocabulary sizes were 64 and 144 for train- 
ing steps 2 and 4 (see Section ||), respectively; the 
non-terminal (NTtag) vocabulary sizes were 61 and 
540 for training steps 2 and 4 (see Sectio n ||), re- 
spect ively We have used the NLPwin ( Heidorn 



1999) parser to obtain the MiPad syntactic treebank 



needed for initializing the SLM at training step 1. 
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Stage 


It 


Training Set 


Test set 


2 (matched) 





9.27 


34.81 


2 (matched) 


1 


5.81 


31.25 


2 (matched) 


2 


5.51 


31.41 


4 (L-matched) 





4.71 


24.39 


4 (L-matched) 


1 


4.61 


24.73 


4 (L-matched) 


2 


4.56 


24.88 



Table 1: Likelihood Evolution during Training 



Although not guaranteed theoretically, the N-best 
EM variant used for the SLM parameter reestima- 
tion increases the likelihood of the training data 
with each iteration when the parser is run in both 
matched (training step 2) and L-matched (training 
step 4) constrained modes. Table [l] shows the evo- 



lution of the training and test data perplexities (cal- 
culated using the probability assignment in Eq. |l|) 
during the constrained training steps 2 and 4. 

The training data perplexity decreases monoton- 
ically during both training steps whereas the test 
data perplexity doesn't decrease monotonically in ei- 
ther case. We attribute this discrepancy between the 
evolution of the likelihood on the training and test 
corpora to the different constrained settings for the 
SLM. 

The most important performance measure is the 
slot/frame error rate. To measure it, we use man- 
ually created parses which consist of frame-level la- 
bels and slot-level labels and spans as reference. A 
frame-level error is caused by a frame label of the 
hypothesis parse which is different from the frame 
label of the reference. In order to calculate the slot- 
level errors, we create a set of slot label and slot 
span pairs for the reference and hypothesis parse, 
respectively. The number of slot errors is then the 
minimum edit distance between these 2 sets using 
the substitution, insertion and deletion operations 
on the elements of the set. 

Table || shows the error rate on training and test 
data at different stages during training. The last col- 
umn of test data results (Test-Ll) shows the results 
obtained by assuming that the user has specified the 
identity of the frame — and thus the frame level con- 
straint contains only the correct semantic tag. This 
is a plausible scenario if the user has the possibility 
to choose the frame using a different input modality 
such as a stylus. The error rates on the training data 
were calculated by running the model with the same 
constraint as on the test data — constraining the set 
of allowable tags at the frame level. This could be 
seen as an upper bound on the performance of the 
model (since the model parameters were estimated 
on the same data). 

Our model significantly outperforms the baseline 
model — a simple semantic context free grammar 
authored manually for the MiPad task — in terms 
of slot error rate (about 35% relative reduction in 
slot error rate) but it is outperformed by the latter 
in terms of frame error rate. When running the mod- 
els from training step 2 on test data one cannot add 
any constraints; only frame level constraints can be 
used when evaluating the models from training step 
4 on test data. N-best reestimation at either train- 
ing stage (2 or 4) doesn't improve the accuracy of 
the system, although the results obtained by intial- 
izing the model using the reestimated stage 2 model 
— iteration 2-{0,l,2} models tend to be slightly bet- 
ter than their 0-{0,l,2} counterparts. Constraining 
the frame level tag to have the correct value doesn't 
significantly reduce the slot error rate in either ap- 
proach, as can be seen from the Test-Ll column of 



results^]. 



6.1 Out-of-domain Initial Statistics 



Recent results (Chelba, 2001) on the portability of 
syntactic structure within the SLM framework show 
that it is possible to initialize the SLM parameters 
from a treebank for out-of-domain text and main- 
tain the same language modeling performance. We 
have repeated the experiment in the context of in- 
formation extraction. 

Similar to the approach in ( Miller ct al., 2000| ) 



we 

initialized the SLM statistics from the UPcnn Tree- 
bank parse trees (about lMwds of training data) at 
the first training stage, see Section |^. The remain- 
ing part of the training procedure was the same as 
in the previous set of experiments. 

The word, slot and frame vocabulary were the 
same as in the previous set of experiments. The 
pre-terminal (POStag) vocabulary sizes were 40 and 
204 for training steps 2 and 4 (see Section ||), respec- 
tively; the non-terminal (NTtag) vocabulary sizes 
were 52 and 434 for training steps 2 and 4 (see Sec- 
tion ||) , respectively. 

The results are presented in Table ^|, showing im- 
proved performance over the model initialized from 
in-domain parse trees. The frame accuracy increases 
substantially, almost matching that of the baseline 
model, while the slot accuracy is just slightly in- 
creased. We attribute the improved performance of 
the model initialized from the UPenn Treebank to 
the fact that the model explores a more diverse set 
of trees for a given sentence than the model initial- 
ized from the MiPad automatic treebank generated 
using the NLPwin parser. 

6.2 Impact of Training Data Size on 
Performance 

We have also evaluated the impact of the training 
data size on the model performance. The results are 
presented in Table [|, showing a strong dependence 
of both the slot and frame error rates on the amount 
of training data used. This, together with the high 
accuracy of the model on training data (see Table |^), 
suggests that we are far from saturation in perfor- 
mance and that more training data is very likely to 
improve the model performance substantially. 

6.3 Error Trends 

As a summary error analysis, we have investigated 
the correlation between the semantic frame/slot er- 
ror rate and the number of semantic slots in a sen- 
tence. We have binned the sentences in the test set 
according to the number of slots in the manual an- 



6 The frame error rate in this column should be 0; in prac- 
tice this doesn't happen because some test sentences could 
not be L-match parsed using the pruning strategy employed 
by the SLM, see Section p~3f 
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Test-Ll 


Stage 2 


Stage 4 


Slot 
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Slot 
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Slot 
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43.41 


7.20 


57.36 


14.90 


57.30 


6.90 








9.78 


1.65 


37.87 


21.62 


37.46 


0.64 





1 


10.36 


1.20 


39.16 


21.80 


38.28 


0.64 





2 


9.42 


1.05 


39.75 


22.25 


38.63 


0.82 


2 





8.92 


1.25 


38.04 


22.07 


37.81 


0.91 


2 


1 


9.01 


0.95 


37.51 


21.89 


37.28 


0.91 


2 


2 


9.47 


0.90 


38.99 


21.89 


38.57 


0.82 



Table 2: Training and Test Data Slot/Frame Error Rates 



Training It 








Error Rate (%) 










Training 


Test 


Test-Ll 


Stage 2 


Stage 4 


Slot 


Frame 


Slot 


Frame 


Slot 


Frame 


Baseline 


43.41 


7.20 


57.36 


14.90 


57.30 


6.90 


0, MiPad/NLPwin 





9.78 


1.65 


37.87 


21.62 


37.46 


0.64 


1, UPenn Trbnk 





8.44 


2.10 


36.93 


16.08 


36.34 


0.91 


1, UPenn Trbnk 


1 


7.82 


1.70 


36.98 


16.80 


36.22 


0.82 


1, UPenn Trbnk 


2 


7.69 


1.50 


36.98 


16.80 


36.22 


1.00 



Table 3: Training and Test Data Slot/Frame Error Rates, UPenn Treebank initial statistics 



Training 


Training It 






Error Rate (%) 






Corpus 






Training 


Test 


Test-Ll 


Size 


Stage 2 


Stage 4 


Slot 


Frame 


Slot 


Frame 


Slot 


Frame 


Baseline 


43.41 


7.20 


57.36 


14.90 


57.30 


6.90 


all 


1, UPenn Trbnk 





8.44 


2.10 


36.93 


16.08 


36.34 


0.91 


1/2 all 


1, UPenn Trbnk 









43.76 


18.44 


43.40 


0.45 


1/4 all 


1, UPenn Trbnk 









49.47 


22.98 


49.53 


1.82 



Table 4: Performance Degradation with Training Data Size 



notation and evaluated the frame/slot error rate in 
each bin. The results are shown in Table [|. 

The frame / slot accuracy increases with the num- 
ber of slots per sentence — except for the 5+ bin 
where the frame error rate increases — showing that 
slot co-ocurence statistics improve performance; sen- 
tences containing more semantic slots tend to be less 
ambiguous from an information extraction point of 
view. 





Error Rate (%) 




No. slots/sent 


Slot 


Frame 


No. Sent 


1 


43.97 


18.01 


755 


2 


39.23 


16.27 


209 


3 


26.44 


5.17 


58 


4 


26.50 


4.00 


50 


5+ 


21.19 


6.90 


29 



Table 5: Frame/Slot Error Rate versus Slot Density 



7 Conclusions and Future Directions 

We have presented a data-driven approach to infor- 
mation extraction that, despite the small amount 
of training data used, is shown to outperform the 
slot level accuracy of a simple semantic grammar 
authored manually for the MiPad — personal infor- 
mation management — task. 

The performance of the baseline model could be 
improved with more authoring effort, although this 
is expensive. 

The big difference in performance between train- 
ing and test and the fact that we are using so 
little training data, makes improvements by using 
more training data very likely, although this may 
be expensive. A framework which utilizes the vast 
amounts of text data collected once such a system 
is deployed would be desirable. Statistical model- 
ing techniques that make more effective use of the 
training data should be use d in the SLM, maximum 
entropy (Berger et al., 1996) being a good candidate. 

As for using the SLM as the language understand- 
ing component of a speech driven application, such 



as MiPad, it would be interesting to evaluate the im- 
pact of incorporating the semantic constraints on the 
word-level accuracy of the system. Another possible 
research direction is to modify the framework such 
that it finds the most likely semantic parse given 
the acoustics — thus treating the word sequence as 
a hidden variable. 
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